Lab 06: Linear regression

Introduction

This week's lab focuses on data modelling using linear regression. At the end of the lab, you should be able to use scikit-learn to:

Create a linear regression model using the least squares technique.
Use the model to predict new values.
Measure the accuracy of the model.
Engineer new features to optimise model performance.

Getting started

Let's start by importing the packages we need. This week, we're going to use the linear_model subpackage from scikit-learn to build linear regression models using the least squares technique. We're also going to need the dummy subpackage to create a baseline regression model, to which we can compare.



In [ ]:

    
%matplotlib inline
import numpy as np
import pandas as pd

from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV, KFold, cross_val_predict

Next, let's load the data. This week, we're going to load the Auto MPG data set, which is available online at the UC Irvine Machine Learning Repository. The dataset is in fixed width format, but fortunately this is supported out of the box by pandas' read_fwf function:



In [ ]:

    
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'

df = pd.read_fwf(url, header=None, names=['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
                                          'acceleration', 'model year', 'origin', 'car name'])

Exploratory data analysis

According to its documentation, the Auto MPG dataset consists of eight explantory variables (i.e. features), each describing a single car model, which are related to the given target variable: the number of miles per gallon (MPG) of fuel of the given car. The following attribute information is given:

mpg: continuous
cylinders: multi-valued discrete
displacement: continuous
horsepower: continuous
weight: continuous
acceleration: continuous
model year: multi-valued discrete
origin: multi-valued discrete
car name: string (unique for each instance)

Let's start by taking a quick peek at the data:



In [ ]:

    
df.head()

As the car name is unique for each instance (according to the dataset documentation), it cannot be used to predict the MPG by itself so let's drop it as a feature and use it as the index instead:

Note: It seems plausible that MPG efficiency might vary from manufacturer to manufacturer, so we could generate a new feature by converting the car names into manufacturer names, but for simplicity lets just drop them here.



In [ ]:

    
df = df.set_index('car name')

df.head()

According to the documentation, the horsepower column contains a small number of missing values, each of which is denoted by the string '?'. Again, for simplicity, let's just drop these from the data set:



In [ ]:

    
df = df[df['horsepower'] != '?']

Usually, pandas is smart enough to recognise that a column is numeric and will convert it to the appropriate data type automatically. However, in this case, because there were strings present initially, the value type of the horsepower column isn't numeric:



In [ ]:

    
df.dtypes

We can correct this by converting the column values numbers manually, using pandas' to_numeric function:



In [ ]:

    
df['horsepower'] = pd.to_numeric(df['horsepower'])

# Check the data types again
df.dtypes

As can be seen, the data type of the horsepower column is now float64, i.e. a 64 bit floating point value.

According to the documentation, the origin variable is categoric (i.e. origin = 1 is not "less than" origin = 2) and so we should encode it via one hot encoding so that our model can make sense of it. This is easy with pandas: all we need to do is use the get_dummies method, as follows:



In [ ]:

    
df = pd.get_dummies(df, columns=['origin'], drop_first=True)

df.head()

As can be seen, one hot encoding converts the origin column into separate binary columns, each representing the presence or absence of the given category. Because we're going to use a linear regression model, to avoid introducing multicollinearity, we must also drop the first of the encoded columns by setting the drop_first keyword argument to True.

Next, let's take a look at the distribution of the variables in the data frame. We can start by computing some descriptive statistics:



In [ ]:

    
df.describe()

Print a matrix of pairwise Pearson correlation values:



In [ ]:

    
df.corr()

Let's also create a scatter plot matrix:



In [ ]:

    
pd.plotting.scatter_matrix(df, s=50, hist_kwds={'bins': 10}, figsize=(16, 16));

Based on the above information, we can conclude the following:

Based on a quick visual inspection, there don't appear to be significant numbers of outliers in the data set. (We could make boxplots for each variable - but let's save time and skip it here.)
Most of the explanatory variables appear to have a non-linear relationship with the target.
There is a high degree of correlation ($r > 0.9$) between cylinders and displacement and, also, between weight and displacement.
The following variables appear to be left-skewed: mpg, displacement, horsepower, weight.
The acceleration variable appears to be normally distributed.
The model year follows a rough uniform distributed.
The cylinders and origin variables have few unique values.

For now, we'll just note this information, but we'll come back to it later when improving our model.

Data Modelling

Dummy model

Let's start our analysis by building a dummy regression model that makes very naive (often incorrect) predictions about the target variable. This is a good first step as it gives us a benchmark to compare our later models to.

Creating a dummy regression model with scikit-learn is easy: first, we create an instance of DummyRegressor, and then we evaluate its performance on the data using cross validation, just like last week.

Note: Our dummy model has no hyperparameters, so we don't need to do an inner cross validation or grid search - just the outer cross validation to estimate the model accuracy.



In [ ]:

    
X = df.drop('mpg', axis='columns')  # X = features
y = df['mpg']                       # y = prediction target

model = DummyRegressor()  # Predicts the target as the average of the features

outer_cv = KFold(n_splits=5, shuffle=True, random_state=0)  # 5 fold cross validation
y_pred = cross_val_predict(model, X, y, cv=outer_cv)        # Make predictions via cross validation

print('Mean absolute error: %f' % mean_absolute_error(y, y_pred))
print('Standard deviation of the error: %f' % (y - y_pred).std())

ax = (y - y_pred).hist()
ax.set(
    title='Distribution of errors for the dummy model',
    xlabel='Error'
);

The dummy model predicts the MPG with an average error of approximately $\pm 6.57$ (although, as can be seen from the distribution of errors the spread is much larger than this). Let's see if we can do better with a linear regression model.

Linear regression model

scikit-learn supports linear regression via its linear_model subpackage. This subpackage supports least squares regression, lasso regression and ridge regression, as well as many other varieties. Let's use least squares to build our model. We can do this using the LinearRegression class, which supports the following options:

fit_intercept: If True, prepend an all-ones predictor to the feature matrix before fitting the regression model; otherwise, use the feature matrix as is. By default, this is True if not specified.
normalize: If True, standardize the input features before fitting the regression model; otherwise use the unscaled features. By default, this is False if not specified.

Generally, it makes sense to fit an intercept term when building regression models, the exception being in cases where it is known that the target variable is zero when all the feature values are zero. In our case, it seems unlikely that an all-zero feature vector would correspond to a zero MPG target value (for instance, consider the meaning of model year = 0 and weight = 0 in the context of the analysis). Consequently, we can set fit_intercept=True below.

Whether to standardize the input features or not depends on a number of factors:

Standardization can mitigate against multicollinearity - but only in cases where supplemental new features have been generated based on a combination of one or more existing features, i.e. where both the new feature and the features it was dervied from are all included as input features.
Standardizing the input data ensures that the resulting model coefficients indicate the relative importance of their corresponding feature - but only in cases where the features are all approximately normally distributed.

In our case, as we are not generating supplmental new features and several of the features are not normally distributed (see the scatter plot matrix above), we can choose not to standardize them (normalize=False) with no loss in advantage.

Note: In cases where there is uncertainty as to whether an intercept should be fit or not, or whether the input features should be standardized or not, or both, we can use a grid search with nested cross validation (i.e. model selection) to determine the correct answer.



In [ ]:

    
X = df.drop('mpg', axis='columns')  # X = features
y = df['mpg']                       # y = prediction target

model = LinearRegression(fit_intercept=True, normalize=False)  # Use least squares linear regression

outer_cv = KFold(n_splits=5, shuffle=True, random_state=0)  # 5-fold cross validation
y_pred = cross_val_predict(model, X, y, cv=outer_cv)        # Make predictions via cross validation

print('Mean absolute error: %f' % mean_absolute_error(y, y_pred))
print('Standard deviation of the error: %f' % (y - y_pred).std())

ax = (y - y_pred).hist()
ax.set(
    title='Distribution of errors for the linear regression model',
    xlabel='Error'
);

Our linear regression model predicts the MPG with an average error of approximately $\pm 2.59$ and a significantly smaller standard deviation too - this is a big improvement over the dummy model!

But we can do better! Earlier, we noted that several of the features had non-linear relationships with the target variable - if we could transform these variables, we might be able to make this relationship more linear. Let's consider the displacement, horsepower and weight variables:



In [ ]:

    
pd.plotting.scatter_matrix(df[['mpg', 'displacement', 'horsepower', 'weight']], s=50, figsize=(9, 9));

The relationship between the target and these predictors appears to be an exponentially decreasing one: as the predictors increase in value, there is an exponential decrease in the target value. Log-transforming the variables should help to remove this effect (logarithms are the inverse mathematical operation to exponentials):



In [ ]:

    
df['displacement'] = df['displacement'].map(np.log)
df['horsepower'] = df['horsepower'].map(np.log)
df['weight'] = df['weight'].map(np.log)

Now, the relationship between the predictors and the target is much more linear:



In [ ]:

    
pd.plotting.scatter_matrix(df[['mpg', 'displacement', 'horsepower', 'weight']], s=50, figsize=(9, 9));

Let's run the analysis a second time and see the effect this has had:



In [ ]:

    
X = df.drop('mpg', axis='columns')
y = df['mpg']

model = LinearRegression(fit_intercept=True, normalize=False)

outer_cv = KFold(n_splits=5, shuffle=True, random_state=0)
y_pred = cross_val_predict(model, X, y, cv=outer_cv)

print('Mean absolute error: %f' % mean_absolute_error(y, y_pred))
print('Standard deviation of the error: %f' % (y - y_pred).std())

ax = (y - y_pred).hist()
ax.set(
    title='Distribution of errors for the linear regression model with transformed features',
    xlabel='Error'
);

As can be seen, the average error has now decreased to $\pm 2.33$ and the standard deviation of the error to 3.12. Further reductions in error might be achieved by experimenting with feature selection, given the high degree of correlation between some of the predictors, or with a more sophisticated model, such as ridge regression.

Building the final model

Once we have identified an approach that satisfies our requirements (e.g. accuracy), we should build a final model using all of the data.



In [ ]:

    
X = df.drop('mpg', axis='columns')
y = df['mpg']

model = LinearRegression(fit_intercept=True, normalize=False)
model.fit(X, y)  # Fit the model using all of the data

We can examine the values of the intercept (if we chose to fit one) and coefficients of our final model by printing its intercept_ and coef_ attributes, as follows:



In [ ]:

    
print(model.intercept_)
print(model.coef_)  # Coefficients are printed in the same order as the columns in the feature matrix, X